29 research outputs found
Gazelle: A Low Latency Framework for Secure Neural Network Inference
The growing popularity of cloud-based machine learning raises a natural
question about the privacy guarantees that can be provided in such a setting.
Our work tackles this problem in the context where a client wishes to classify
private images using a convolutional neural network (CNN) trained by a server.
Our goal is to build efficient protocols whereby the client can acquire the
classification result without revealing their input to the server, while
guaranteeing the privacy of the server's neural network.
To this end, we design Gazelle, a scalable and low-latency system for secure
neural network inference, using an intricate combination of homomorphic
encryption and traditional two-party computation techniques (such as garbled
circuits). Gazelle makes three contributions. First, we design the Gazelle
homomorphic encryption library which provides fast algorithms for basic
homomorphic operations such as SIMD (single instruction multiple data)
addition, SIMD multiplication and ciphertext permutation. Second, we implement
the Gazelle homomorphic linear algebra kernels which map neural network layers
to optimized homomorphic matrix-vector multiplication and convolution routines.
Third, we design optimized encryption switching protocols which seamlessly
convert between homomorphic and garbled circuit encodings to enable
implementation of complete neural network inference.
We evaluate our protocols on benchmark neural networks trained on the MNIST
and CIFAR-10 datasets and show that Gazelle outperforms the best existing
systems such as MiniONN (ACM CCS 2017) by 20 times and Chameleon (Crypto Eprint
2017/1164) by 30 times in online runtime. Similarly when compared with fully
homomorphic approaches like CryptoNets (ICML 2016) we demonstrate three orders
of magnitude faster online run-time
An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for End-to-End Security in IoT Applications
This paper presents a reconfigurable cryptographic engine that implements the
DTLS protocol to enable end-to-end security for IoT. This implementation of the
DTLS engine demonstrates 10x reduction in code size and 438x improvement in
energy-efficiency over software. Our ECC primitive is 237x and 9x more
energy-efficient compared to software and state-of-the-art hardware
respectively. Pairing the DTLS engine with an on-chip RISC-V allows us to
demonstrate applications beyond DTLS with up to 2 orders of magnitude energy
savings.Comment: Published in 2018 IEEE International Solid-State Circuits Conference
(ISSCC
An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for End-to-End Security in IoT Applications
This paper presents a reconfigurable cryptographic engine that implements the
DTLS protocol to enable end-to-end security for IoT. This implementation of the
DTLS engine demonstrates 10x reduction in code size and 438x improvement in
energy-efficiency over software. Our ECC primitive is 237x and 9x more
energy-efficient compared to software and state-of-the-art hardware
respectively. Pairing the DTLS engine with an on-chip RISC-V allows us to
demonstrate applications beyond DTLS with up to 2 orders of magnitude energy
savings.Comment: Published in 2018 IEEE International Solid-State Circuits Conference
(ISSCC
Fast Vector Oblivious Linear Evaluation from Ring Learning with Errors
Oblivious linear evaluation (OLE) is a fundamental building block in multi-party computation protocols. In OLE, a sender holds a description of an affine function , the receiver holds an input , and gets (where all computations are done over some field, or more generally, a ring). Vector OLE (VOLE) is a generalization where the sender has many affine functions and the receiver learns the evaluation of all of these functions on a single point .
The state-of-the-art semi-honest VOLE protocols generally fall into two groups. The first group relies on standard assumptions to achieve security but lacks in concrete efficiency. These constructions are mostly based on additively homomorphic encryption (AHE) and are classified as ``folklore . The second group relies on less standard assumptions, usually properties of sparse, random linear codes, but they manage to achieve concrete practical efficiency.
In this work, we present a conceptually simple VOLE protocol that derives its security from a standard assumption, namely Ring Learning with Errors (RLWE), while still achieving concrete efficiency comparable to the fastest VOLE protocols from non-standard coding assumptions. Furthermore, our protocol admits a natural extension to batch OLE (BOLE), which is yet another variant of OLE that computes many OLEs in parallel
A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications
High Efficiency Video Coding, the latest video standard, uses larger and variable-sized coding units and longer interpolation filters than [H.264 over AVC] to better exploit redundancy in video signals. These algorithmic techniques enable a 50% decrease in bitrate at the cost of computational complexity, external memory bandwidth, and, for ASIC implementations, on-chip SRAM of the video codec. This paper describes architectural optimizations for an HEVC video decoder chip. The chip uses a two-stage subpipelining scheme to reduce on-chip SRAM by 56 kbytes-a 32% reduction. A high-throughput read-only cache combined with DRAM-latency-aware memory mapping reduces DRAM bandwidth by 67%. The chip is built for HEVC Working Draft 4 Low Complexity configuration and occupies 1.77 mm[superscript 2] in 40-nm CMOS. It performs 4K Ultra HD 30-fps video decoding at 200 MHz while consuming 1.19 [nJ over pixel] of normalized system power.Texas Instruments Incorporate
A 249Mpixel/s HEVC video-decoder chip for Quad Full HD applications
The latest video coding standard High Efficiency Video Coding (HEVC) provides 50% improvement in coding efficiency compared to H.264/AVC, to meet the rising demand for video streaming, better video quality and higher resolutions. The coding gain is achieved using more complex tools such as larger and variable-size coding units (CU) in a hierarchical structure, larger transforms and longer interpolation filters. This paper presents an integrated circuit which supports Quad Full HD (QFHD, 3840×2160) video decoding for the HEVC draft standard. It addresses new design challenges for HEVC (“H.265”) with three primary contributions: 1) a system pipelining scheme which adapts to the variable-size largest coding unit (LCU) and provides a two-stage sub-pipeline for memory optimization; 2) unified processing engines to address the hierarchical coding structure and many prediction and transform block sizes in area-efficient ways; 3) a motion compensation (MC) cache which reduces DRAM bandwidth for the LCU and meets the high throughput requirements which are due to the long filters.Texas Instruments Incorporate
An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for Securing Internet-of-Things Applications
This paper presents the first hardware implementation of the Datagram
Transport Layer Security (DTLS) protocol to enable end-to-end security for the
Internet of Things (IoT). A key component of this design is a reconfigurable
prime field elliptic curve cryptography (ECC) accelerator, which is 238x and 9x
more energy-efficient compared to software and state-of-the-art hardware
respectively. Our full hardware implementation of the DTLS 1.3 protocol
provides 438x improvement in energy-efficiency over software, along with code
size and data memory usage as low as 8 KB and 3 KB respectively. The
cryptographic accelerators are coupled with an on-chip low-power RISC-V
processor to benchmark applications beyond DTLS with up to two orders of
magnitude energy savings. The test chip, fabricated in 65 nm CMOS, demonstrates
hardware-accelerated DTLS sessions while consuming 44.08 uJ per handshake, and
0.89 nJ per byte of encrypted data at 16 MHz and 0.8 V.Comment: Published in IEEE Journal of Solid-State Circuits (JSSC
Does Fully Homomorphic Encryption Need Compute Acceleration?
The emergence of cloud-computing has raised important privacy questions about the data that users share with remote servers. While data in transit is protected using standard techniques like Transport Layer Security (TLS), most cloud providers have unrestricted plaintext access to user data at the endpoint. Fully Homomorphic Encryption (FHE) offers one solution to this problem by allowing for arbitrarily complex computations on encrypted data without ever needing to decrypt it. Unfortunately, all known implementations of FHE require the addition of noise during encryption which grows during computation. As a result, sustaining deep computations requires a periodic noise reduction step known as bootstrapping. The cost of the bootstrapping operation is one of the primary barriers to the wide-spread adoption of FHE.In this paper, we present an in-depth architectural analysis of the bootstrapping step in FHE. First, we observe that se-cure implementations of bootstrapping exhibit a low arithmetic intensity (100MB) and as such, are heavily bound by the main memory bandwidth.Consequently, we demonstrate that existing workloads observe marginal performance gains from the design of bespoke high-throughput arithmetic units tailored to FHE. Secondly, we propose several cache-friendly algorithmic optimizations that improve the throughput in FHE bootstrapping by enabling upto3.2Ă—higher arithmetic intensity and4.6Ă—lower memory bandwidth. Our optimizations apply to a wide range of structurally similar computations such as private evaluation and training of machine learning models. Finally, we incorporate these optimizations into an architectural tool which, given a cache size, memory subsystem, the number of functional units and a desired security level, selects optimal cryptosystem parameters to maximize the bootstrapping throughput.Our optimized bootstrapping implementation represents a best-case scenario for compute acceleration of FHE. We show that despite these optimizations, bootstrapping (as well as other applications with similar computational structure) continue to remain bottlenecked by main memory bandwidth. We thus conclude that secure FHE implementations need to look beyond accelerated compute for further performance improvements and to that end, we propose new research directions to address the underlying memory bottleneck. In summary, our answer to the titular question is: yes, but only after addressing the memory bottleneck